Site Selection for Starbucks

Giulia Carella

SDSC20 Workshop - Integrating CARTOframes into Spatial Data Science workflows. October 21 2020

Context

Site selection refers to the process of deciding where to open a new or relocate an exiting store/facility by comparing the merits of potential locations. If some attribute of interest is known (e.g. the revenues of existing stores) statistical and machine learning methods can be used to help with the decision process.

What we will do

In this demo, we will look at where Starbucks should open new coffee shops in Long Island, NY. Specifically, we will build a model for the revenues of the existing stores as a funcion of sociodemographic data and use this model to predict the expected revenue in each block group.

We will:

Conclusions

0. Set up

Set up the virtual environment

First of all, install virtualenv inside your project folder to secure your project directory to avoid conflict with your other packages.

pip install virtualenv

After installing this run this command one by one inside your root project directory:

virtualenv venv

source venv/bin/activate

Now Your directory is secure and you can install your required packages inside.

pip install -r requirements.txt

To exit the virtual environment type:

deactivate

Import the libraries

Set CARTO credentials

Go to your CARTO dashboard anc click on API KEY to get your key.

1. Upload Starbucks store data to your CARTO account

First we upload the Starbucks data that are stored in your local in the starbucks_long_island.csv file. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week).

Next, we will plot the distribution of the annual revenues by store.

2. Geocode locations

Since we only know the addresses, we first need to geocode them to extract the corresponding geographical locations, which will be used to assign to each store the Census data of the corresponding block group.

Upload data to CARTO

Visualize the store locations

3. Download demographic and socioeconomic data by block group

Next, we will download from CARTO DATA Observatory (www.carto.com/data) the demographic and socioeconomic variables that we will use to build a model for the stores revenues. We will use data from the The American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017.

Upload Long Island geometry

As we are interested only in the data for the Long Island area, we first upload the geographical boundaries of this area from the manhattan_long_island.geojson file, which is stored locally.

Look for the datasets of interest using the Data Discovery methods

More documentation here: https://carto.com/developers/cartoframes/guides/Data-discovery/

Let's start by exploring the Data Catalog to see what data categories CARTO can offer for data in the US.

Then check the available providers and geometries and select the most recent ACS data at the block group level

And explore in greater detail the metadata

Finally get the data and geometries

4. Model preparation: normalization and dimensionality reduction

First normalize extensive variables by the total population

When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalize them, e.g. by dividing by the total area or the total population, depending on the application. Using the metadata available in CARTO Data Observatory we will check which of all the variables can be considered extensive by looking at their default aggregation method: if the aggregation method is classified as SUM, then the variable is normalized by dividing by the total population.

Any missing data?

Missing data are common in surveys, as the ACS. Before using the ACS data as covariates we therefore need to check if there are any missing data and to decide how to impute the missing values.

For each variable, the figure below shows the data points that are missing (corresponding to the rows in white). We notice that the amount of missing observations is typically small but depends on the variable.

Are the variables correlated?

Next, we will check the pair-wise correlations between the ACS variables. Correlated covariates are problematic for inference in linear (and generalized linear) models as they lead to inflation of the variance of the regression coefficient (https://en.wikipedia.org/wiki/Variance_inflation_factor) but do not necessarly represent a problem for prediction. However, if correlated variables are present, we can take advantage of these correlations to reduce the dimensionality of the model and save computation time.

Looking at the pair-wise correlations of a random sample of 50 variables, we notice large correlations.

To account for missing values and reduce the model dimensionality, we will transform the data using Principal Component Analysis

To impute the missing data and reduce the model dimensionality we will use a probabilistic formulation of Principal Component Analysis (PCA). PCA is a technique to transform data sets of high dimensional vectors into lower dimensional ones that finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.

In its probabilistic formulation, probabilistic PCA (PPCA) the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components are not observed and are treated as latent variables, which also allows to extend the method to when missing data are present.

More information can be found at http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf.

PPCA in a nutshell

PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:

\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}

with

\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}

Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.

  1. E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}

  2. M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}

where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.

Impute missing values and extract PC scores (linearly decorrelated)

Plot explained variance

First, we compute the explained variance as a function of the number of PC scores retained to decide how many PCs we should keep. We decide to keep the PCs up that explain up to 80% of the variance (the black dashed line in the plot below), but this choice is somewhat arbitrary and might vary depending on the application.

Plot PC correlations

To understand the relashionship between the transformed variables (the PC scores) and the original variable, we can plot the 10 variables most highly correlated with each PC, as shown in the bar plot plot which shows the correlation for each of these variables. For example, we see that the first PC, which is that that explains the most of the variance, is positively correlated with the density of ownner occupied housing units but negatively correlated with the density of renter occupied housing units, as also the map below shows.

Note: depending on the random seed on your computer, the signs of the correlations might change.

Map the first Principal Component

Select only up the PC up to expalain 80 % of the variance (dimensionality reduction)

Finally, Merge the ACS data and the store data, and group the store revenue data by block group

5. Train the model using the data for the exisiting stores

Having prepared the model covariates, then we will model the annual revenues by store ($\mathbf{y}$) using a Generalized Linear Model (GLM) with the selected PC scores as covariates ($\mathbf{X}$).

\begin{equation*} E\big[\mathbf{y}\vert\mathbf{X}\big] = g^{-1}\big(\sum_j \; X_j * beta_j \big) \end{equation*}

where $E$ is the expectation value of the observations $\mathbf{y}$ conditional on the covariates $\mathbf{X}$ and $g$ is a link function. Given the distribution of the annual revenues, we will use a Gamma likelihood with a logarithmic link function.

Looking at the model results we notice a significant positive correlation with the first PC (which is higher in richer areas) and a significant negative correlation with the second and third PCs which are positive correlated with workforce density and density of female seniors respectively.

Note: the individual correlations with the PC and therefore with the original variables, might depend on the random seed on your computer but will be consistent overall.

To asses the model accuracy, we then plot the observed vs. the predicted values and compute the pseudo-$R^2$ score which is defined as

\begin{equation*} 1 - \dfrac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \end{equation*}

As shown by the plot, below which indicates that most of the data points lie on the 1-1 line and a high value of the pseudo-$R^2$ score, we can conclude that our model is accurate in the in-sample predictions.

While GLM model typically do not overfit, we can also check the pseudo-$R^2$ score of the training and testing errors.

6. Get the predictions by Census block group

Finally, we can use this model to predict the annual revenue ($) for each block group in Long Island.

7. Visualize and share

With CARTOframes you can then share your visualizations by publishing your map using a url.

Upload data to CARTO

Visualize the predictions

Share your results

5. Can we improve our model by accounting for any spatial dependence in the residuals?

Having assessed the performance of the non-spatial GLM model, we can check if the model residuals ${y_i-\hat{y_i}}$ show any residual spatial dependence, indicating that better results might be obtained using a model that explicitly takes the property that an observation is more correlated with an observation collected at a neighboring location than with another observation that is collected from farther away.

Including a spatially-structured random effects model that incorporates such spatial dependency is beneficial when the variables used as covariates in the model may not be sufficient to explain the variability of the observations and the measurements, given the predictors, are not independent. Including a spatially-structured random effect in the model helps mitigating the situation.

How to go spatial for spatially continuous processes?

A spatial stochastic process can be decomposed as:

\begin{equation*} y(\mathbf{s}) = \mu(\mathbf{s}) + w(\mathbf{s}) + \varepsilon \end{equation*}

where $\mu(\mathbf{s})$ is the mean function (deterministic and smooth), $\varepsilon$ is an independent process and

\begin{equation*} w(\mathbf{s}) \sim GP\left(0, C(\mathbf{s}, \mathbf{s'}, \mathbf{\theta})\right) \end{equation*}

where the covariance $C(\mathbf{s}, \mathbf{s'}, \mathbf{\theta})$ typically depends on the Euclidean length of the spatial separation vector only $|\mathbf{h}| = |\mathbf{s} - \mathbf{s'}|$, and potentially on some hyperparameters $\mathbf{\theta}$. For example the exponential covariance function is defined as:

\begin{equation*} C \left(|\mathbf{h}|\right) = \sigma^2 exp(-\dfrac{1}{\rho} |\mathbf{h}|) \end{equation*}

where the empirically derived definition for the range is $\rho = \dfrac{\sqrt{8 \nu}}{\kappa}$.

with the covariance between two points tending to zero as the points become further separated in space.

Fitting spatial models, comes with some caveats:

Variograms

Variograms represent a useful tool to check for spatial dependence in the residuals. The variogram is defined as:

\begin{equation*} Var\left[y(\mathbf{s + h}) - y(\mathbf{s})\right]^2 =: 2\gamma(\mathbf{h}) = 2 \left(C \left(0\right) - C \left(|\mathbf{h}|\right)\right)\end{equation*}

First, we will construct an empirical variogram by grouping the pairs of observations in bins based on their distance (if isotropy is assumed) and averaging the squared differences from the values for all pairs. Given a semivariogram model, we can then fit this model to the empirical semivariogram and estimate the model parameters.

First, fit a variogram model on the residuals

The variogram models parameters are:

Then, let's fit a Gaussian Process GLM model

We will now extend the GLM model to account for the residual spatial dependence, by fitting this model:

\begin{equation*} E\big[\mathbf{y}\vert\mathbf{X}\big] = g^{-1}\,\Big(\sum_j \; X_j * beta_j + u(\mathbf{s}_i) + \varepsilon_i \Big) \end{equation*}

where $u(\mathbf{s}_i)$ represents the spatially structured residuals modelled as a Gaussian Process with an exponential covariance function and $\varepsilon_i$ ad IID random term.

We will fit this model in a Bayesian framework using the programming language Stan (https://mc-stan.org). However, to reduce the model complexity we will fix the covariance parameters and use those derived from the variogram analysis (in a fully Bayesian approach these would be estimated).

The stan model is saved locally in _stan/GP_gamma_modelexact.stan.

Extract the posterior mean and standard deviations

Finally check that we have removed any residual spatial autocorrelation

We fit now the variogram mode to the model residuals, using the posterior mean.

Compared to the previous variogram plot, we can see that by adding a spatially structured random effect we have consistently reduced the spatial dependence in the residuals (check the scale of the y-axis).

Next steps:

Clearly using the above model we have showed that we have removed the spatial dependence in the residuals. The next step would then be to use this model to make the predictions for the annual revenues for each census block group. However, because the computation time for GPs scales cubically with the data locations, we need some approximations which goes beyond the scope of this workshop. More information can be found here: